Automate corpora testing in CI by mustansir14 · Pull Request #4927 · trufflesecurity/trufflehog

mustansir14 · 2026-04-29T08:22:36Z

Motivation

When adding or modifying a detector, the key question is: how much noise will this regex produce against real-world code? Too many false positives means alert fatigue; a regex that's too tight misses real secrets.

Previously this was a fully manual process — download a large corpus locally, run the pipeline, inspect the DuckDB output. There was no enforcement in CI, so it was easy to skip or forget, especially under time pressure.

This PR automates it. Every PR that touches detector code now gets a comment showing exactly how many unique matches the changed detector produces, compared to the main baseline. The comment updates in place on every push, so the PR timeline stays clean.

What it tells you

The bench scans a corpus of real-world public code (S3 datasets, ~35 GB in total compressed) using only the detectors changed in the PR, with verification disabled. It reports unique match counts for the PR build vs. the main baseline:

New matches — the regex picks up strings main didn't; could be intentional (broader coverage) or noise (too loose)
Removed matches — the regex narrowed; could be intentional (tighter pattern) or a regression (missing real secrets)

PRs that add a new detector will see a 🆕 row with an absolute match count and no baseline comparison.

Example output

What changed

.github/workflows/detector-corpora-test.yml — new workflow; triggers on PRs touching pkg/detectors/**; detects which detectors changed, builds PR and main binaries in parallel, runs both scans against the corpus concurrently, posts a sticky comment with the diff
scripts/test/detect_changed_detectors.sh — resolves changed detector directories to their proto enum names for --include-detectors scoping; skips detectors whose diff doesn't touch regex patterns or Keywords() so PRs that only change verification or redaction logic don't trigger a bench run
scripts/test/detector_corpora_test.sh — streams corpus files (S3 or local), runs trufflehog with --no-verification, outputs JSONL
scripts/test/diff_corpora_results.py — diffs two JSONL result sets and renders the Markdown report posted to the PR

Performance

The naive implementation was slow. Two full scans (PR + main) each streamed the entire corpus independently from S3 — on a 35 GB dataset that meant two downloads, two decompressions, two jq passes, and two trufflehog runs, serialized per dataset file. First runs took ~50 minutes.

Three optimizations brought first runs down to ~38 minutes and subsequent pushes to the same PR down to ~30 minutes:

Main scan caching — The main binary scans the same commit (the merge base) on every push to a PR, as long as no rebase happens. We cache /tmp/results-main.jsonl in GitHub Actions keyed by merge-base SHA + scoped detector set. On subsequent pushes without a rebase, the entire main side (worktree checkout, go build, S3 download, scan) is skipped entirely.

Single S3 stream with tee — When the main scan does need to run (cache miss), both PR and main binaries now consume the same S3 stream via tee >(main_binary stdin). S3 is downloaded and decompressed once; both scans process the content in parallel at the OS pipe level.

Scoped scanning — Only the detectors changed in the PR are passed via --include-detectors. Scanning the full detector set against a large corpus is most of the work; scoping to 1–3 changed detectors cuts runtime proportionally.

Fork PRs

Excluded for fork PRs — S3 credentials are not available to fork-originated workflows. Maintainers need to run this manually for forked PRs.

Running locally

./scripts/test/detector_corpora_test.sh /path/to/contents.jsonl.zstd

Results land in /tmp/corpora_results.jsonl with a DuckDB summary table printed to stdout.

Note

Medium Risk
Adds a new CI workflow that pulls large corpora from S3 using repository secrets and posts sticky PR comments, so failures/mis-scoping or credential issues can impact PR checks and leak operational signals (though no product runtime paths change).

Overview
Adds an automated detector corpora regression benchmark in CI that runs on PRs touching detector code and posts a sticky  PR comment with unique-match deltas (PR vs merge-base main) for the impacted detectors.

Introduces scripts to (1) detect which detectors actually changed matching behavior (regex/Keywords()), (2) stream one or more zstd JSONL corpora files (local or s3://) through trufflehog with --include-detectors and --no-verification, and (3) diff the two JSONL outputs into a Markdown report (including handling for new detectors with no baseline).

Optimizes CI runtime by caching the baseline scan keyed by merge-base SHA + detector set, building main and PR binaries in parallel, and teeing a single corpus stream to both binaries when a baseline scan is needed; also updates .gitignore to ignore Python bytecode artifacts.

^{Reviewed by Cursor Bugbot for commit 6c3bbae. Bugbot is set up for automated code reviews on this repo. Configure here.}

…al runs to get PR issue number

The bench uses --no-verification, so the engine's overlap-path dedup (which exists to protect verifiers from duplicate calls) adds noise without value here — it causes shifts in unrelated detectors when only one detector's regex changes. Pair --allow-verification-overlap with --no-verification so each detector's regex behavior is measured independently. Also fix the false 'no diff vs main' claim that triggered when NEW/REMOVED were zero but total counts differed.

…s emoji

awk's END block doesn't run when trufflehog exits before draining stdin (SIGPIPE kills awk first), leaving the bytes file empty and breaking the step with a `$((TOTAL_BYTES + ))` syntax error. Read the file with a default of 0 and validate it's an integer before arithmetic. Also fold unzstd/jq stderr into STDERR_FILE so benign Broken pipe notices stay out of CI logs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…pection Static AST parse of a detector package to extract the strings returned by its Keywords() method. Used by the upcoming keyword-corpus builder to fan out per-detector GitHub Code Search queries during the corpora bench. AST-first because each detector lives in its own package; importing them dynamically would require codegen or `plugin`. Falls back to a regex over the function body, then a directory-wide grep, when AST resolution can't statically resolve the return value (helper calls, build-tagged variants). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… are detected

github-actions · 2026-05-04T13:07:43Z

Corpora Test Results

No detector regex or keyword changes in this PR. Bench skipped.

…urity/trufflehog into hackathon/detector-tests-in-ci

bradlarsen

This looks useful!

bradlarsen · 2026-05-04T20:51:54Z

+    SUM(CASE WHEN Verified AND VerificationError IS NULL THEN 1 ELSE 0 END) verified,
+    SUM(CASE WHEN NOT Verified AND VerificationError IS NULL THEN 1 ELSE 0 END) unverified,
+    SUM(CASE WHEN VerificationError IS NOT NULL THEN 1 ELSE 0 END) \"unknown\"


Running with --no-verification above makes these values not meaningful, right?

Agreed! Thanks for catching this. Will remove

bradlarsen · 2026-05-04T20:57:58Z

+
+for CORPORA_FILE in "$@"; do
+    if [[ "$CORPORA_FILE" == s3://* ]]; then
+        aws s3 cp "$CORPORA_FILE" - | scan /dev/stdin


Note: though stdin is likely fine for this kind of testing, the newer json-enumerator input source in TruffleHog might make this a bit more straightforward. That input source expects NDJSON where each value looks like this:

{"data": "utf8 string to scan", "metadata": <arbitrary JSON value>}

or this:

{"data_b64": "base64-encoded bytestring to scan", "metadata": <arbitrary JSON value>}

Hmm, json-enumerator would give better chunk isolation and carry path metadata through. But in this context we're running with --no-verification and the metadata isn't used downstream, so the practical difference is minimal. Keeping stdin for now to avoid any unexpected behavioral changes, but happy to revisit if there's a specific benefit you had in mind.

bradlarsen · 2026-05-04T21:01:26Z

+    local rc=0
+    if [[ -n "${TRUFFLEHOG_BIN_MAIN:-}" ]]; then
+        # Single S3 download teed to both binaries simultaneously.
+        unzstd -c "$input" 2>> "$STDERR_FILE" \
+            | jq -r .content 2>> "$STDERR_FILE" \
+            | tee >(
+                "${TRUFFLEHOG_BIN_MAIN}" \
+                    --no-update \
+                    --no-verification \
+                    --allow-verification-overlap \
+                    --log-level=3 \
+                    --concurrency=8 \
+                    --json \
+                    "${main_include_flag[@]}" \
+                    stdin >> "${OUTPUT_JSONL_MAIN}" 2>> "$STDERR_FILE"
+              ) \
+            | "$TRUFFLEHOG_BIN" \
+                --no-update \
+                --no-verification \
+                --allow-verification-overlap \
+                --log-level=3 \
+                --concurrency=8 \
+                --json \
+                --print-avg-detector-time \
+                "${INCLUDE_FLAG[@]}" \
+                stdin >> "$OUTPUT_JSONL" 2>> "$STDERR_FILE"
+        rc=$?
+        wait
+    else
+        unzstd -c "$input" 2>> "$STDERR_FILE" \
+            | jq -r .content 2>> "$STDERR_FILE" \
+            | "$TRUFFLEHOG_BIN" \
+                --no-update \
+                --no-verification \
+                --allow-verification-overlap \
+                --log-level=3 \
+                --concurrency=8 \
+                --json \
+                --print-avg-detector-time \
+                "${INCLUDE_FLAG[@]}" \
+                stdin >> "$OUTPUT_JSONL" 2>> "$STDERR_FILE"
+        rc=$?
+    fi


When I last used trufflehog stdin like this, I found that it would timeout after 1 minute, leaving you with an incomplete scan when dealing with large inputs. The workaround was, confusingly, to specify --archive-timeout=6h (or some similarly large value).

You might want to check that the scan is not being terminated early! This is another reason why you might prefer the json-enumerator input source, which doesn't have this timeout gotcha.

I don't think this is an issue in here. The tests consistently go 30+ minutes and successfully complete without truncation. I think it's because we decompress the corpus ourselves with unzstd before piping to stdin, so TruffleHog never sees a compressed archive and the archive timeout never applies.

bradlarsen · 2026-05-04T21:14:14Z

+    s3://trufflehog-corpora-datasets/contents.2025-11-04.jsonl.zstd
+    s3://trufflehog-corpora-datasets/contents.jsonl.zstd
+
+jobs:


Note, some of the actions used here are old versions. Also, you might consider pinning the action versions used here to reduce risk of possible supply-chain attacks.

zizmor is helpful: https://docs.zizmor.sh/

Thanks for this. Really helpful! I'll do the needful

bradlarsen · 2026-05-04T21:17:33Z

+  pull_request:
+    paths:
+      - 'pkg/detectors/**'
+      - 'pkg/engine/defaults/defaults.go'
+      - '.github/workflows/detector-corpora-test.yml'
+      - 'scripts/test/detector_corpora_test.sh'
+      - 'scripts/test/diff_corpora_results.py'
+      - 'scripts/test/detect_changed_detectors.sh'


Since the CPU work required by a single run of this workflow is pretty expensive (30+ minutes), is this something we want running automatically on pull requests (as it looks like it does here), or only as an opt-in workflow?

Good question. The workflow is scoped to only trigger when a PR modifies regex patterns or Keywords() in a detector. Purely structural changes (verification logic, redaction, comments, etc.) are filtered out and skip the bench entirely. In practice this means it only runs on PRs that actually affect match behavior.

We also don't currently merge a detector without running this test manually, so automating it in CI seems like the right call. it ensures the check never gets skipped and gives reviewers the data they need without having to ask for it.

…e everything works as expected

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 6c3bbae. Configure here.}

cursor · 2026-05-05T12:03:25Z

+                "${INCLUDE_FLAG[@]}" \
+                stdin >> "$OUTPUT_JSONL" 2>> "$STDERR_FILE"
+        rc=$?
+        wait


Main binary failure in process substitution silently ignored

Low Severity

In dual-binary mode, the main binary runs inside a tee >(...) process substitution whose exit code is never checked. If it crashes or exits early, wait runs under set +e so the failure is silently swallowed. Worse, if tee receives SIGPIPE from the broken process substitution pipe, it may also die, truncating input to the PR binary — producing silently incomplete results on both sides with no error surfaced.

^{Reviewed by Cursor Bugbot for commit 6c3bbae. Configure here.}

add detector corpora test workflow and script

720b203

mustansir14 requested a review from a team April 29, 2026 08:22